In this notebook we use the following R packages: tidyverse and ggplot2. In addition we use the multiplot functionality provided by Cookbook for R. In Python we use fastai, numpy and pandas. Full code for training and inference available at GitHub. All code for statistical analysis is included in this document.
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.
In my master’s thesis we apply recent insights connecting dropout neural networks to Bayesian approximate variational inference (VI). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called MC dropout NNs.
The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will apply MC dropout in a classification task based on the collection of labelled images in the CIFAR-10 data set.
A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units.
In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).
Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly “dropping out” nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.
Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time.
Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:
Mathematically, model uncertainy is approximated the empirical standard deviation of the predictions for class \(k\), i.e. \[\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}\] where \[\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)\] is the averaged softmax outputs of the predicted class.
Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a Bayesian neural network with a Bernoulli approximating prior over the parameters. Gal has written an excellent blog post that introduces the work and the derivation.
The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].
In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be “modelled as Bayesian”.
As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network’s ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. We will examine if there is any additional information to be gained from establishing a connection between the prediction and the runner-up prediction. In other words, our problem statement is:
Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?
State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5.
LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but still captures the gist of what a convolutional network is while remaining simple enough to allow us to understand every building block of the network. The following chunk shows the model architecture:
class lenet_all(nn.Module):
def __init__(self, conv_size=conv_size, pool_size=2, drop_rate=p):
super().__init__()
self.drop_rate = drop_rate
self.conv1 = nn.Conv2d(in_channels=3, out_channels=192, kernel_size=conv_size)
self.dropmc1 = DropoutMC(p)
self.pool1 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
self.conv2 = nn.Conv2d(in_channels=192, out_channels=192, kernel_size=conv_size, padding=2)
self.dropmc2 = DropoutMC(p)
self.pool2 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
self.dense1 = nn.Linear(in_features=7*7*192, out_features=1000)
self.dropmc3 = DropoutMC(p)
self.dense2 = nn.Linear(in_features=1000, out_features=10)
def forward(self, x):
x = self.dropmc1(self.conv1(x))
x = self.pool1(x)
x = self.dropmc2(self.conv2(x))
x = self.pool2(x)
x = x.view(x.size(0), -1)
x = self.dropmc3(F.relu(self.dense1(x)))
x = self.dense2(x)
return x
The above specification of LeNet-5 differs from the orginial in one crucial way: We use Monte Carlo dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet one ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the Dropout class to take an additional argument called dropoutMC with default value set to True. :
class DropoutMC(nn.Module):
r"""
Modified version of Dropout from torch/nn/modules/dropout.py
Args:
p: probability of an element to be zeroed. Default: 0.5
dropoutMC: If set to ``True``, dropout is turned on at test time. Default: ``True`
inplace: If set to ``True``, will do this operation in-place. Default: ``False``
Shape:
- Input: `Any`. Input can be of any shape
- Output: `Same`. Output is of the same shape as input
Examples::
>>> m = nn.Dropout(p=0.2)
>>> input = autograd.Variable(torch.randn(20, 16))
>>> output = m(input)
.. _Improving neural networks by preventing co-adaptation of feature
detectors: https://arxiv.org/abs/1207.0580
"""
def __init__(self, p=0.5, dropoutMC=True, inplace=False):
super(DropoutMC, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, "
"but got {}".format(p))
self.p = p
self.dropoutMC = dropoutMC
self.inplace = inplace
def forward(self, input):
return F.dropout(input, self.p, self.dropoutMC, self.inplace)
def __repr__(self):
inplace_str = ', inplace' if self.inplace else ''
return self.__class__.__name__ + '(' \
+ 'p=' + str(self.p) \
+ inplace_str + ')'
We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1:
def inference(learner, data, T=100):
''' Function that gathers all relevant numerical results from MC dropout over T iterations.
Arguments:
learner, fastai learner object
data, fastai dataloader
T, number of stochastic forward passes
'''
# Get images, labels and filenames
imgs, labels = next(iter(data.val_dl))
fnames = data.val_ds.fnames
# Empty dictionary to store all output
output = {}
# Empty array to store results in
results = np.empty((T, num_classes))
# iterator index to keep in dictionary
k=0
for (img, label, fname) in list(zip(imgs, labels, fnames)):
for i in range(T):
prediction = learner.predict_array(img[None])
results[i] = prediction
probs = to_np(F.softmax(V(results)))
probs_mean = np.mean(probs, axis=0)
pred_std = np.std(probs, axis=0)
prediction = probs_mean.argmax()
uncertainty = pred_std[prediction]
correct = 1 if prediction == label else 0
output[k] = {"img": fname, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct}
k+=1
return output
The function inference stores all the relevant statistics and softmax distributions in a dictionary named output. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R. The relevant code can be found on GitHub.
We will examine data gathered from four variants of LeNet-5. All models were trained on CIFAR-10 using the fastai API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. All models have weight_decay = 0.0005 and all learning rates were chosen using lr_find:
model55: Trained for 60 epochs with a learning rate of 0.001 and kernel size = (5,5) and drop_rate = .5. 0.71280 validation loss at end of training with an accuracy of 0.76137 on the validation data. model55 represents the baseline implementation of LeNet-5. It is identical in structure to the one used by Gal et. al. [1].
model52: Trained for 14 epochs with a learning rate of 0.01 for the first 7 and 0.0001 on the remaining. Changed due to rapid overfitting. Model has kernel size = (5,5) and drop_rate = .2. 0.74186 validation loss at end of training with an accuracy of 0.74406 on the validation data.
model35: Trained for 66 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .5. 0.75483 validation loss at end of training with an accuracy of 0.74218 on the validation data.
model32: Trained for 52 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .2. 0.75449 validation loss at end of training with an accuracy of 0.74090 on the validation data.
Note that model55 has a slightly better baseline performance than the other models.
The data contains the following variables after it has been prepared for analysis in R:
correct (logical): indicator the is TRUE if the predicted class label matches the true class label, else FALSE.
prediction (int): predicted class label.
truth (int): true class label.
uncertainty (dbl): empirical standard deviation of softmax values for predicted class.
prob1 (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.
prob2 (dbl): mean probability of runner-up prediction.
class2 (int): class label of runner-up prediction.
logit_prob1 (dbl): logit transformation of prob1.
diff (dbl): prob1-prob2
diff_sd_ratio (dbl): diff/uncertainty.
PUT THIS IN AFTER PLOT, EDIT IF NECESSARY Intuitively, if diff is large, the averaged models all agree that class \(k\) is the correct prediction. If diff is small, the models sampled by MC dropout don’t agree on a single class. Thus diff also serves as a proxy for uncertainty. Model uncertainty, however, is approximated by the empirical standard deviation of the predictions for class \(k\). Thus diff_sd_ratio is expressed by \[\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}\] where \(j\) is the runner-up prediction. \(\tau_{jk}\) gives us a ratio of two different measures of uncertainty. Consider the following cases:
\(\tau_{kj} \approx 1\) means that diff and uncertainty are relatively similar. This happens if a) the models have failed to reach a consensus (diff is small) but model uncertainty is low, or b) the models have reached a consensus (diff is large) but model uncertainty is high. Let’s call these risky predictions.
\(\tau_{kj} \to 0\) means that uncertainty is much larger than diff. This happens if the models have failed to reach a consensus and model uncertainty is high. These should represent uncertain predictions.
\(\tau_{kj} \to \infty\) means that uncertainty is much smaller than diff, e.g. the models have reached a consensus and model uncertainty is low. These should represent certain predictions.
This section will be divided into to parts. First, we will examine the resulting data from model55, which will be regarded as our baseline model. Next, we will analyse the data from models obtained by varying kernel sizes and dropout rates.
# Importing data
data <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/lenet-model55.csv"))
df <- select(data, -X)
# Summarizing entire data set
summary(df)
## correct prediction truth uncertainty
## Min. :0.0000 Min. :0.000 Min. :0.0 Min. :0.0000029
## 1st Qu.:1.0000 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:0.0400673
## Median :1.0000 Median :5.000 Median :4.5 Median :0.1274580
## Mean :0.7916 Mean :4.561 Mean :4.5 Mean :0.1184386
## 3rd Qu.:1.0000 3rd Qu.:7.000 3rd Qu.:7.0 3rd Qu.:0.1831702
## Max. :1.0000 Max. :9.000 Max. :9.0 Max. :0.3526330
## prob1 prob2 class2 diff
## Min. :0.1835 Min. :0.0000004 Min. :0.000 Min. :0.0000124
## 1st Qu.:0.5814 1st Qu.:0.0188524 1st Qu.:2.000 1st Qu.:0.3340403
## Median :0.8281 Median :0.1021314 Median :4.000 Median :0.7213734
## Mean :0.7657 Mean :0.1343779 Mean :4.082 Mean :0.6312929
## 3rd Qu.:0.9700 3rd Qu.:0.2257584 3rd Qu.:6.000 3rd Qu.:0.9503560
## Max. :1.0000 Max. :0.4989194 Max. :9.000 Max. :0.9999990
## diff_sd_ratio logit_prob1
## Min. : 0.0 Min. :-1.4929
## 1st Qu.: 1.7 1st Qu.: 0.3284
## Median : 5.4 Median : 1.5722
## Mean : 208.5 Mean : 2.1424
## 3rd Qu.: 23.5 3rd Qu.: 3.4750
## Max. :340800.1 Max. :14.3329
# Summarizing incorrect predictions
df %>%
filter(correct==0) %>%
summary()
## correct prediction truth uncertainty
## Min. :0 Min. :0.000 Min. :0.000 Min. :0.001013
## 1st Qu.:0 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0.153040
## Median :0 Median :4.000 Median :4.000 Median :0.181794
## Mean :0 Mean :4.342 Mean :4.049 Mean :0.180244
## 3rd Qu.:0 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.:0.210907
## Max. :0 Max. :9.000 Max. :9.000 Max. :0.333903
## prob1 prob2 class2 diff
## Min. :0.1835 Min. :0.000185 Min. :0.000 Min. :0.0000124
## 1st Qu.:0.4103 1st Qu.:0.173676 1st Qu.:2.000 1st Qu.:0.0959377
## Median :0.5264 Median :0.239908 Median :4.000 Median :0.2403054
## Mean :0.5451 Mean :0.242866 Mean :4.038 Mean :0.3022316
## 3rd Qu.:0.6602 3rd Qu.:0.312481 3rd Qu.:6.000 3rd Qu.:0.4645013
## Max. :0.9994 Max. :0.498919 Max. :9.000 Max. :0.9992357
## diff_sd_ratio logit_prob1
## Min. : 0.0001 Min. :-1.4929
## 1st Qu.: 0.5130 1st Qu.:-0.3629
## Median : 1.2146 Median : 0.1055
## Mean : 2.8113 Mean : 0.2501
## 3rd Qu.: 2.4979 3rd Qu.: 0.6642
## Max. :986.4482 Max. : 7.4530
# Summarizing correct predictions
df %>%
filter(correct==1) %>%
summary()
## correct prediction truth uncertainty
## Min. :1 Min. :0.000 Min. :0.000 Min. :0.0000029
## 1st Qu.:1 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0.0263485
## Median :1 Median :5.000 Median :5.000 Median :0.0965002
## Mean :1 Mean :4.619 Mean :4.619 Mean :0.1021673
## 3rd Qu.:1 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:0.1672548
## Max. :1 Max. :9.000 Max. :9.000 Max. :0.3526330
## prob1 prob2 class2 diff
## Min. :0.1863 Min. :0.0000004 Min. :0.000 Min. :0.0007845
## 1st Qu.:0.7009 1st Qu.:0.0114910 1st Qu.:2.000 1st Qu.:0.5144494
## Median :0.9002 Median :0.0618456 Median :4.000 Median :0.8352473
## Mean :0.8237 Mean :0.1058168 Mean :4.094 Mean :0.7179230
## 3rd Qu.:0.9820 3rd Qu.:0.1758222 3rd Qu.:7.000 3rd Qu.:0.9697941
## Max. :1.0000 Max. :0.4886857 Max. :9.000 Max. :0.9999990
## diff_sd_ratio logit_prob1
## Min. : 0.0 Min. :-1.4742
## 1st Qu.: 2.8 1st Qu.: 0.8516
## Median : 8.6 Median : 2.1997
## Mean : 262.7 Mean : 2.6406
## 3rd Qu.: 36.9 3rd Qu.: 3.9965
## Max. :340800.1 Max. :14.3329
# Aggregating summary statistics by correct/incorrect
agg_df <- df %>%
group_by(correct) %>%
summarise(n=n(),
mean_uncertainty=mean(uncertainty),
sd_uncertainty=sd(uncertainty),
mean_diff=mean(diff),
sd_diff=sd(diff),
mean_ratio=mean(diff_sd_ratio),
sd_ratio=sd(diff_sd_ratio))
agg_df
The model has incorrectly classified 2084 images. The mean uncertainty estimate for the incorrect predictions is \(\sigma_{incorrect} =\) 0.1802444. The standard deviation of the uncertainty estimate is 0.0461661.
The model has correctly classified 7916 images. The mean estimated uncertainty is \(\sigma_{correct} =\) 0.1021673. The standard deviation of the uncertainty estimates in the correctly classified cases is 0.0780871.
In the following we will visualize the relationships between our variabels. We start by examining the empirical distribution of the uncertainty estimates \(\hat{\sigma}_k\):
# Distribution of estimated uncertainty
p1 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_histogram(col="grey", bins = 50, alpha=.5) +
ggtitle("Distribution of estimated uncertainty")
p1
The distribution appears to be bimodal. By grouping the uncertainty estimates by
correct (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution:
#Distribution of estimated uncertainty by prediction
p2 <- df %>%
ggplot(aes(x=uncertainty, col=as.factor(correct))) +
geom_freqpoly(alpha=.7) +
ggtitle("Distribution of estimated uncertainty by classification") +
scale_color_discrete(name="Prediction",
breaks=c("0", "1"),
labels=c("0: Incorrect", "1: Correct"))
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications. The following kernel density estimate plot (using a Gaussian kernel) gives us an idea of how the distributions compare to eachother:
# KDE by correct prediction
p3 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_density(data=subset(df, correct==0), fill="red", alpha=I(.2)) +
geom_density(data=subset(df, correct==1), fill="turquoise", alpha=I(.2)) +
ggtitle("Kernel density estimates of uncertainty distribution")
p3
The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability:
# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>%
ggplot(aes(x=as.factor(correct), y=uncertainty)) +
geom_boxplot(aes(fill=as.factor(correct)), alpha=.7) +
labs(x="correct") +
ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
scale_fill_discrete(name="Prediction",
breaks=c("0", "1"),
labels=c("0: Incorrect", "1: Correct"))
p4
We may also be interested in the relationship between uncertainty and other variables. First, we plot uncertainty against prob1 (the prediction’s softmax output):
# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(alpha=.2) +
ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
labs(x="softmax output of predicted class") +
scale_color_distiller(name="Softmax output",
palette = "Spectral")
p5
The softmax outputs in the above plot are colour graded. Outputs close to 1 are red, outputs close to 0 are blue. The plot shows a clear parabolic shape. We can obtain more information by colouring the points by the value of the runner-up predictions:
# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p5 <- df %>%
ggplot(aes(x=prob1, y=uncertainty, col=prob2)) +
geom_point(alpha=.5) +
ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
labs(x="softmax output of predicted class") +
scale_color_distiller(name="Runner up",
palette = "Spectral")
p5
This plot is particularly interesting. The softmax outputs of the runner-up classes \(\hat{\mu}_j\) in the above plot are colour graded. \(\hat{\mu}_j \approx 0.5\) are red, \(\hat{\mu}_j \approx 0\) are blue. We see a clear concentration of red points in the area where the probability of the predictied class \(\hat{\mu}_k \approx 0.5\). If the softmax predictions of both the predicted class and the runner-up are close 0.5, then we have a situation analogous to maximum entropy. This points coincide with the largest approximated uncertainty values.
In the following plot we split the observations by incorrect/correct predictions:
# Plotting uncertainty vs. softmax output of prediction
p6 <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(aes(col=prob2),alpha=0.5) +
geom_density_2d(col="black", alpha=.3) +
ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
labs(x="softmax output of prediction") +
facet_grid(.~as.factor(correct)) +
scale_color_distiller(name="Runner up",
palette = "Spectral")
p6
The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. This is not surprising, considering 75% of the correct predictions have a softmax value of approximately 0.7 or above. For the incorrect classifications, most of the points are concentrated around the area of maximum entropy. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.
The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a LOESS estimate of the mean uncertainty as a function of the runner-up output to make this clearer:
# Plotting softmax output of runner-up against estimated uncertainty
p7 <- df %>%
ggplot(aes(x=prob2, y=uncertainty)) +
geom_point(alpha=.2) +
geom_smooth(method="loess") +
labs(x="softmax output of runner-up") +
ggtitle("Uncertainty vs. softmax output of runner-up")
p7
Let us take a look at the five most uncertain, incorrectly classified images:
For the most ucertain images the runner-up suggestion is non-sensical. But what would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?
# Checking if runner-up is equal to groung truth
df %>%
filter(correct==0) %>%
mutate(correct=if_else(class2==truth,1,0)) %>%
count(correct)
Accuracy would rise to 0.9079. This indicates that there may be some valuable information to be gathered from the runner-up.
In the following we look at the five most certain, incorrectly classified images:
Logistic regression gives us a simple way of testing if the approximated uncertainty is a useful predictor of the model’s ability to predict correctly.
# Fitting model on full data set to check significance of predictor
model <- glm(as.factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model)
##
## Call:
## glm(formula = as.factor(correct) ~ uncertainty, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6704 0.2409 0.3588 0.7100 2.0114
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.55249 0.07625 46.59 <2e-16 ***
## uncertainty -15.40803 0.43250 -35.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10236.6 on 9999 degrees of freedom
## Residual deviance: 8467.6 on 9998 degrees of freedom
## AIC: 8471.6
##
## Number of Fisher Scoring iterations: 5
We gather the data from all models in a single dataframe df:
ROOT <- "~/Documents/Masteroppgave/Data/Resultater/"
# Kernel size 3, drop rate = .2
df32 <- process(ROOT, "lenet-model32.csv", "model32", 3, .2)
# Kernel size 3, drop rate = .5
df35 <- process(ROOT, "lenet-model35.csv", "model35", 3, .5)
# Kernel size 5, drop rate = .5
df55 <- process(ROOT, "lenet-model55.csv", "model55", 5, .5)
# Kernel size 5, drop rate = .2
df52 <- process(ROOT, "lenet-model52.csv", "model52", 5, .2)
# Combining all dataframes and adding kernel size, dropout rate
df <- df32 %>%
bind_rows(df35) %>%
bind_rows(df52) %>%
bind_rows(df55) %>%
mutate(correct=as.factor(as.logical(correct)),
kernel=as.factor(as.character(kernel)),
dropout=as.factor(as.character(dropout)))
head(df)
summary(df)
## correct prediction truth uncertainty
## FALSE: 9160 Min. :0.000 Min. :0.0 Min. :0.0000013
## TRUE :30840 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:0.0367498
## Median :5.000 Median :4.5 Median :0.1041035
## Mean :4.568 Mean :4.5 Mean :0.1025274
## 3rd Qu.:7.000 3rd Qu.:7.0 3rd Qu.:0.1551265
## Max. :9.000 Max. :9.0 Max. :0.3526330
## prob1 prob2 class2 diff
## Min. :0.1694 Min. :0.0000004 Min. :0.000 Min. :0.0000124
## 1st Qu.:0.5685 1st Qu.:0.0234855 1st Qu.:2.000 1st Qu.:0.3084196
## Median :0.8069 Median :0.1119699 Median :4.000 Median :0.6873797
## Mean :0.7530 Mean :0.1398601 Mean :4.098 Mean :0.6131886
## 3rd Qu.:0.9624 3rd Qu.:0.2325295 3rd Qu.:6.000 3rd Qu.:0.9374409
## Max. :1.0000 Max. :0.4989194 Max. :9.000 Max. :0.9999992
## diff_sd_ratio logit_prob1 model kernel
## Min. : 0.0 Min. :-1.5899 Length:40000 3:20000
## 1st Qu.: 1.9 1st Qu.: 0.2757 Class :character 5:20000
## Median : 5.8 Median : 1.4300 Mode :character
## Mean : 242.5 Mean : 1.9877
## 3rd Qu.: 25.3 3rd Qu.: 3.2416
## Max. :798986.9 Max. :15.0261
## dropout
## 0.2:20000
## 0.5:20000
##
##
##
##
Note that we have added two new variables to the data set: kernel (factor, convolution size) and dropout (factor, dropout rate). In the following chunk we generate some summary statistics:
# Summary stats for all models
total_uncertainty <- df %>%
group_by(model) %>%
summarise(accuracy=sum(as.numeric(correct)-1)/10000,
mean_uncertainty=mean(uncertainty),
sd_uncertainty=sd(uncertainty),
mean_diff=mean(diff),
sd_diff=sd(diff),
mean_ratio=mean(diff_sd_ratio),
sd_ratio=sd(diff_sd_ratio)) %>%
dplyr::arrange(mean_uncertainty)
total_uncertainty
Overall:
model52 is the most certain.
model35 is the least certain.
It seems as though models trained with a higher dropout rate are more accurate, but less certain.
The mean differences between the predicted and runner-up class are close to 0.6 for all models.
The models with the lowest mean uncertainty correspond with large mean values of \(\tau_{jk}\).
model52: Accuracy has increased by 0.01514 percent.model55: Accuracy has increased by 0.03023 percent.model32: Accuracy has increased by 0.0167 percent.model35: Accuracy has increased by 0.03342 percent.# Aggregating summary statistics by model and correct/incorrect
agg_df <- df %>%
group_by(model, correct) %>%
summarise(p=n()/10000,
mean_uncertainty=mean(uncertainty),
sd_uncertainty=sd(uncertainty),
mean_diff=mean(diff),
sd_diff=sd(diff),
mean_ratio=mean(diff_sd_ratio),
sd_ratio=sd(diff_sd_ratio))
# Sorting by uncertainty: most certain to least certain
ordered_df <- agg_df %>%
dplyr::arrange(mean_uncertainty)
ordered_df
# Distributions of uncertainty by model (left) and by model/classification (right)
p1 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_histogram(alpha=.5, fill="blue", bins=50) +
facet_grid(as.factor(model)~.) +
ggtitle("Distributions of uncertainty")
p2 <- df %>%
ggplot(aes(x=uncertainty, col=correct)) +
geom_freqpoly(alpha=.7) +
ggtitle("Distribution of estimated uncertainty by classification") +
scale_color_discrete(name="Prediction",
breaks=c("0", "1"),
labels=c("0: Incorrect", "1: Correct")) +
facet_grid(as.factor(model)~.)
layout <- matrix(c(1,2), nrow=1)
multiplot(p1, p2, layout=layout)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# KDEs
p3 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_density(aes(fill=correct), alpha=.7) +
facet_grid(as.factor(model)~correct) +
ggtitle("Kernel density estimations of uncertainty by classification")
p3
# Boxplots of uncertainty by model and classification
df %>%
ggplot(aes(x=correct, y=uncertainty)) +
geom_boxplot(aes(fill=correct), alpha=.7) +
facet_grid(.~as.factor(model))
# Boxplots of log-transformed tau by model and classification
df %>%
ggplot(aes(x=correct, y=log(diff_sd_ratio))) +
geom_boxplot(aes(fill=correct), alpha=.7) +
facet_grid(.~as.factor(model))
# Plotting uncertainty against softmax of predicted class, with density estimation contours
df %>%
ggplot(aes(x=uncertainty, y=prob1, col=prob1)) +
geom_point(alpha=.5) +
geom_density2d(col="black", alpha=.5) +
labs(y="softmax output of predicted class") +
scale_color_distiller(name="",
palette = "Spectral") +
facet_grid(~as.factor(model)~correct)
# Plotting log-transformed tau against uncertainty
df %>%
ggplot(aes(x=uncertainty, y=log(diff_sd_ratio), col=log(diff_sd_ratio))) +
geom_point(alpha=.2) +
#geom_density2d(col="black", alpha=.5) +
labs(y="log(tau)") +
facet_grid(~as.factor(model)~correct)
[1] Gal, Y. & Ghahramani, Z. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv: 1506.02158 (2015).
[2] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv: 1506.02142 (2015).
[3] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Appendix. arXiv: 1506.02157 (2015).
[4] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports volume 7, Article number: 17816 (2017).
[5] Hinton, G. & Srivastava, N. et. al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: 1207.0580 (2012).
[6] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection: supplementary material. Scientific Reports volume 7, Article number: 17816 (2017).